This project is aiming to analyse public sequencing data, specifically snRNA sequencing data. We aim to investigate whether two different genes (GFRAL and Adcyap1) are expressed in the same cells or not from an area of the brain called dorsal vagal complex (DVC), which is mainly formed by two regions, the area postrema (AP) and the nucleus of the tractus solitarii (NTS).
The GSE166649 data set from Ludwig et al. (2021) consists of 20 snRNA-seq samples, split into 50 files. Prepped with Chromium 10X 3’ v2. Some are sequenced on NextSeq and are split into four files, while other samples are sequenced on NovaSeq and split into two files. Based on the file names I assume the samples come from different batches collected during a period of about five-six months.
The GSE206144 data set from Ilanges et al. (2022) consists of six sample, of which three are cell hashing libraries (HTO) and the rest are transcriptomics libraries. Library prep used was 10X Chromium v3. They didn’t report what antibody hashtags was used for the samples, making it impossible (?) to make use of the HTO libraries for demultiplexing.
All sampels were downloaded and pre-processed using the pipeline nf-core/scrnaseq v.2.7.1 running with cellranger v 8.0.0.
After looking at the multiqc report, I could conclude that:
Most samples only had roughly half of the expected number of cells (or I misunderstood the authors when they wrote “9,000 nuclei were used for snRNA-seq and snATAC–seq“, I took it as they have used 9,000 nuclei for the snRNA-seq and 9,000 nuclei for the snATAC-seq. Perhaps they meant 9,000 nuclei in total were used for BOTH snRNA-seq and snATAC-seq. Then the number of cells are as approximately as expected for the majority of samples).
Around 80-90% of the reads are mapped, with around 30-100% mapping to exonic regions. This is expected since we’re working with single-nuclei data and typically therefore see a high percentage of reads coming from unspliced transcripts which align to introns.
The median number of genes per cell is 500-1,500, which is fine. The saturation plot suggests that some samples are under sequenced, and we might be missing some lowly expressed genes in there.
The barcode knee plot indicate high contamination of ambient RNA (according to 10X). Since there’re no clear cliffs in the plots, just removing empty empty droplets based on a hard UMI threshold probably isn’t sufficient. Using tools like CellBender, SoupX or DecontX are to prefer since they can model the data when proven difficult to distinguish empty droplets from cell-containing ones.The two samples from GSE206144 have notably better quality than the samples from GSE166649, and those two samples actually wouldn’t need ambient RNA removal but I will include them as well to be consistent.
In droplet-based single-cell RNA sequencing, some background (ambient) mRNA is always present in the solution. This mRNA can end up in the droplets along with the cells and get sequenced. As a result, some of the detected gene expression comes from the surrounding solution rather than the cell inside the droplet. SoupX was used to remove the ambient RNA. I investigated the change in expression due to this correction for our genes of interest (Fig 2.1). It shows the fraction of expression in each cell that has been deemed to be soup and removed. Please note that the figures are just examples from ONE sample, and this was done for all 22 samples. As you can see for Adcyap1, the expression in some clusters have been heavily decreased. The interpretation of cell expression can change a lot after contamination correction.
Figure 2.1: Change in expression values after ambient RNA removal for Adcyap1 (left) and Gfral (right) for one sample
To filter out doublets I used DoubletFinder DoubletFinder identifies false-negative Demuxlet classifications caused by doublets formed from cells with identical SNP profiles. DoubletFinder is sensitive to heterotypic doublets i.e., doublets formed from transcriptionally-distinct cell states, but is insensitive to homotypic doublets, i.e. doublets formed from transcriptionally-similar cell states.
After all QC and filtering there were in total 98,428 cells left.
Due to large differences between (and within) the datasets, it’s not surprising that the origin of datasets would have a large impact on clustering. To remove this influence from the cell embeddings the two datasets were integrated using Harmony. The integration mixed the datasets together to mitigate this bias.
Figure 2.2: Clustering colored by study before integration (left) and after integration (right).
Before anntation cluster was done using Seurat with resolution 0.8.
Annotation was done using gene marker list from the Ludwig et al. (2022) paper and the R-package AUCell.This method ranks genes by their expression values within each cell and constructs a response curve of the number of genes from each marker set that are present with increasing rank. It then computes the area under the curve (AUC) for each marker set, quantifying the enrichment of those markers among the most highly expressed genes in that cell. This is roughly similar to performing a Wilcoxon rank sum test between genes in and outside of the set, but involving only the top ranking genes by expression in each cell.
I retrived gene marker lists from the cell types from Ludwig et al. and annotated all sub types (non neuronal and neuronal) with AUCell.
Figure 2.3: Clustering colored by cell types
I proceeded with subsetting only the neuronal cell types and proceed with further exploring the gene expressions values in these cells. After subsetting, there was 64,756 neuronal cells left.
Figure 2.4: Clustering colored by neuronal cell types
Looking at the sample origin of the cells in each cluster we can see that most clusters consist of a mix of pretty much all samples.There are however a few potentially problematic clusters with just a few samples represented. The pristine samples from the second smaller dataset make up all cells in three clusters. They may represent rare but biologically meaningful populations and should be retained.
Figure 2.5: Sample origin cell cluster compostion
First, I used the gene marker lists for NTS and AP from Ludwig et al. to try to annotate the cells that come from AP and the ones from NTS. I reduced the lists (both containing 1000 genes) to only contain genes with log2 fold change > 1 which also overlapped in the seurat object. I was left with 766 genes for AP and 660 genes for NTS. Perhaps not surprisingly, given that most cell types were annotated as NTS by Ludwig et al., we see that most cells here are annotated as NTS as well. There are a few clusters of AP cells enriched, which fairly well corresponds to the cell types found by the authors (Glu4, GABA5 and GABA7 in particular). Since they didn’t report any marker genes for DMV that region couldn’t be predicted, but the chat-types do cluster separately as expected.
Figure 2.6: Clustering colored by brain region of interest (AP or NTS).
Overall percentage of cells expressing Gfral is 0.91 %, and Adcyap1 is 7.05 %. In total, 0.07 % of cells expressed both (around 70 cells). The cluster with the highest expression of both genes is Glu4 (1.15 %), followed by Glu5 (0.72 %).
| Cluster | Gfral_pct | Adcyap1_pct | Both_pct |
|---|---|---|---|
| OPCs | 0.429 | 9.490 | 0.035 |
| GABA2 | 0.326 | 2.839 | 0.036 |
| GABA1 | 0.537 | 3.127 | 0.044 |
| astrocytes | 0.197 | 4.691 | 0.017 |
| Glu8 | 1.239 | 7.503 | 0.046 |
| GABA4 | 0.354 | 2.545 | 0.000 |
| Glu6 | 0.393 | 12.072 | 0.112 |
| Glu9 | 0.258 | 7.881 | 0.043 |
| chat3 | 0.181 | 49.819 | 0.000 |
| Glu3 | 0.478 | 14.225 | 0.053 |
| Glu13 | 1.346 | 7.570 | 0.120 |
| GABA5 | 0.698 | 2.243 | 0.050 |
| tancyte-like_cells | 0.395 | 3.162 | 0.000 |
| Glu1 | 0.352 | 13.092 | 0.070 |
| oligodendrocytes | 0.351 | 4.642 | 0.013 |
| Glu5 | 7.620 | 12.369 | 0.718 |
| Glu10 | 0.296 | 7.469 | 0.000 |
| GABA6 | 0.606 | 4.178 | 0.067 |
| GABA3 | 0.411 | 2.259 | 0.000 |
| Glu4 | 15.925 | 6.338 | 1.152 |
| Glu14 | 0.783 | 7.718 | 0.000 |
| Glu7 | 0.644 | 12.409 | 0.086 |
| GABA7 | 0.471 | 2.353 | 0.000 |
| VLMCs | 0.499 | 4.988 | 0.000 |
| chat1 | 0.306 | 31.885 | 0.245 |
| endothelial_cells | 0.145 | 1.447 | 0.000 |
| chat2 | 0.299 | 0.149 | 0.000 |
| Glu2 | 0.827 | 7.877 | 0.039 |
| Glu11 | 0.926 | 4.938 | 0.000 |
| ependymal_cells | 0.163 | 3.533 | 0.000 |
| microglia | 0.484 | 3.632 | 0.000 |
| Glu15 | 0.355 | 4.610 | 0.000 |
| Glu12 | 2.857 | 7.143 | 0.000 |
Gfral is mostly expressed in Glu4 and Glu5, whereas Adcyap1 show expression in multiple clusters, most notably chat1, chat3 and several glutamatergic cell types (Glu1, Glu3, Glu5, Glu6 and Glu7). In the Glu4 cluster, 6.3% of cells express Adcyap1.
Figure 3.1: Co-expression patterns for Gfral and Adcyap1 in neurons.
I used scLink to infer functional gene co-expression networks for Gfral and Adcyap1, and also including genes shown to have been co-expressed with these genes. Looking at the co-expression values for Gfral/Adcyap1 in all neurons, the results indicate a high level of co-expression between the two (scLink’s correlation = 0.82). Looking closer at the co-expression values per cluster type revealed that the co-expression is limited to the Glu4 cells (scLink’s correlation = 0.84).
Figure 3.2: Heatmap of co-expression correlation values for Gfral/Adcyap1 per cluster from scLink
A closer look at the Glu4 cluster show the co-expression with Adcyap1 as well as confirm that Gfral is co-expressed with Glp1r as expected (reported by Ludwig et al.).
Figure 3.3: Heatmap of co-expression correlation values in the Glu4 cluster from scLink
Percentage of Glu4 cells that are also annotated as AP is 25.7 %. This is higher compared to clusters which we expect to be NTS, for instance for the neighboring Glu5 cluster the value of AP cells is just around 1 %.
Gfral is a lowly expressed gene and this presents challenges, including increased noise and variability in the data, which may obscure true biological signals. The correlation between Gfral and Adcyap1 may be less reliable due to the limited number of expressing cells, potentially leading to biased interpretations. Despite these challenges, focusing on the Glu4 cluster where 16 % of cells expressed Gfral allows for a more meaningful analysis of Gfral’s role, emphasizing the need for careful consideration of expression patterns and statistical methods to accurately assess biological relevance. Also since many samples appeared to be under-sequenced, it may be that we’re not capturing Gfral expression as good as we could with deep sequencing.
There is a clear problem with expression analysis in scRNA-seq caused by sparse data and over-dispersion of counts, leading to conventional methods like Pearson and Spearman’s correlation coefficients not providing an efficient approach to representing and interpreting gene associations. The field is advancing rapidly and there are now multiple R-packages available which are made to account for the problems with scRNA-seq data. I proceeded with using scLink, a library designed for scRNA-seq data which calculate the correlation between gene pairs and then use a method to learn sparse dependencies between genes to construct sparse gene co-expression networks. It would be useful to test other packages as well to see if the results can be validated.
The cell type annotation didn’t yield the prettiest clusters the world has seen. I noticed that in some clusters there’s a lack of clear bimodality for the corresponding label, which indicate that its gene set is not sufficiently informative. The advantage of the AUCell approach is that it does not require reference expression values, which is useful when dealing with gene sets derived from the literature like I did here. The disadvantage is that information on relative expression is lost when only the marker identities are used. The net effect of ignoring expression values is difficult to predict; for example, it may reduce performance for resolving more subtle cell types, but may also improve performance if the per-cell expression was too noisy to be useful. Still, I think reference data with gene expression information could have helped.
Gfral is mostly expressed in glutamatergic neurons here annotated as Glu4. Adcyap1 show expression in the same cluster but is at the same time more generally expressed in other cell types. The co-expression analysis with scLink indicate that Gfral and Adcyap1 are highly co-expressed. Looking closer at the co-expression of these two genes per cell type reveals that this significant co-expression is limited to Glu4 cells (scLink’s correlation = 0.84). The Glu4 cells are located in the AP according to Ludwig et al., and here we see an enrichment for AP cells in the Glu4 cluster. Some caution is warranted in interpreting these results due low expression levels. Lowly expressed genes can still be biologically significant, but interpreting their role requires careful consideration of their expression patterns.
All code is available on GitHub at BDC-research-projects/P24-203.
Articles, books and tutorials I’ve used for this project:
Ludwig, M.Q., Cheng, W., Gordian, D. et al. A genetic map of the mouse dorsal vagal complex and its role in obesity. Nat Metab 3, 530–545 (2021). https://doi.org/10.1038/s42255-021-00363-1
Ilanges, A., Shiao, R., Shaked, J. et al. Brainstem ADCYAP1+ neurons control multiple aspects of sickness behaviour. Nature 609, 761–771 (2022). https://doi.org/10.1038/s41586-022-05161-7
Allen Reference Atlas – Mouse Brain [brain atlas]. Available from atlas.brain-map.org.